WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition
نویسندگان
چکیده
A signiicant new speech corpus of British English has been recorded at Cambridge University. Derived from the Wall Street Journal text corpus, WSJCAM0 constitutes one of the largest corpora of spoken British English currently in existence. It has been speciically designed for the construction and evaluation of speaker-independent speech recognition systems. The database consists of 140 speakers each speaking about 110 utterances. This paper describes the motivation for the corpus , the processes undertaken in its construction and the utilities needed as support tools. All utterance transcriptions have been veriied and a phonetic dictionary has been developed to cover the training data and evaluation tasks. Two evaluation tasks have been deened using standard 5,000 word bigram and 20,000 word trigram language models. The paper concludes with comparative results on these tasks for British and American English.
منابع مشابه
Issues in Large Vocabulary, Multilingual Speech Recognition
In this paper we report on our activities in multilingual, speaker-independent,large vocabulary continuous speech recognition. The multilingual aspect of this work is of particular importance in Eu-rope, where each country has its own national language. Our existing recognizer for American English and French, has been ported to British English and German. It has been assessed in the context of ...
متن کاملInvestigation of Indian English Speech Recognition using CMU Sphinx
In the recent years, research on speech recognition has given much diligence to the automatic transcription of speech data such as broadcast news (BN), medical transcription, etc. Large Vocabulary Continuous Speech Recognition (LVCSR) systems have been developed successfully for Englishes (American English (AE), British English (BE), etc.) and other languages but in case of Indian English (IE),...
متن کاملSpoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملSpeaker Adaptation in Continuous Speech Recognition Using MLLR-Based MAP Estimation
A variety of methods are used for speaker adaptation in speech recognition. In some techniques, such as MAP estimation, only the models with available training data are updated. Hence, large amounts of training data are required in order to have significant recognition improvements. In some others, such as MLLR, where several general transformations are applied to model clusters, the results ar...
متن کاملSpecifics of Hidden Markov Model Modifications for Large Vocabulary Continuous Speech Recognition
Specifics of hidden Markov model-based speech recognition are investigated. Influence of modeling simple and context-dependent phones, using simple Gaussian, two and threecomponent Gaussian mixture probability density functions for modeling feature distribution, and incorporating language model are discussed. Word recognition rates and model complexity criteria are used for evaluating suitabili...
متن کامل